Search Ranking
Problem Statement
Ask for questions: Scale, Scope, Personalization.
- Scope: general search or specialized search?
- Scale: number of websites? QPS (queries per second)?
- Personalization: logged-in user or not
Metrics
Online Metrics
- Click-through rate
$$Click through rate = \frac{Number of clicks}{Number of impressions} $$
- Successful session rate
Include the Dwell Time (the length of time a searcher spends viewing a webpage after they’ve clicked a link on a SERP-search engine per page) into consideration.
$$Session success rate = \frac{no. of successful sessions}{no. of total sessions}$$
- Caveat
Attention is to consider zero-click searches. A SEPR may answer the searcher’s query at the top.
Time of success: a low number of queries per session - successful search session.
Offline Metrics
- Normalized discounted cumulative gain (NDCG)
$$CG_p = \sum_{i=1}^{p}rel_i$$
1 | In the cumulative gain, the rel = relevance rating of a document |
$$DCG_p = \sum_{i=1}^{p} \frac{rel_i}{log_2(i+1)}$$
1 | The discounted cumulative gain allows us to penalize the search engines's ranking if highly relevant documents appear lower in the result list. |
Caveat: $DCG$ can’t be used to compare the search engine’s performance across different queries on an absolute scale. (the $DCG$ for a query with a longer result list may be higher due to its length instead of its quality.)
$$NDCG_p = \frac{DCG_p}{IDCG_p}$$ where $IDCG$ is ideal discounted cumulative gain.
1 | NDCG normalizes the DCG in the 0 to 1 score range by dividing the DCG by the max DCG or the IDCG of the query. |
$$NDCG = \frac{\sum_{p=i}^NNDCG_i}{N}$$
An $NCDG$ value near 1 indicates good performance by the search engine.
- Caveat
$NCDG$ does not penalize irrelevant search results.
Architecture
Document Selection
Document selection is more focused on recall.
- Selection criteria
- Inverted index: an index data structure that stores a mapping from content.
- Go into the index and retrieve all the documents based on the above selection criteria. -> sorted relevant documents according to their relevance score. -> forward the top documents to the ranker.
- Relevance scoring scheme
- Weighted linear scheme from:
- Terms match: inverse document frequency or IDF score
- Document popularity
- Query intent match
- Personalization match.
- Weighted linear scheme from:
Feature Engineering
- Query-specific features
- Query historical engagement
- Query intent
- Example: Users search for ‘pizza place’, the intent here is ‘local’.
- Document-specific features
- Page rank
- Document engagement radius
- Context-specific features
- Time of search
- Recent events
- Searcher-document features
- Distance
- Historical engagement
- Query-document features
- Text match
- Unigram or bigram
- TF-IDF (term frequency-inverse document frequency) match score can also be an important relevance signal for the model.
- Query-document historical engagement, like click rate
- Embeddings. To calculate similarity score between the query’s vector and the document’s vector.
Training Data Generation
Pointwise approach
- Assign positive/negative samples based on click rate or successful session rate.
- Caveat: Less negative examples. To remedy it, we use random negative examples.
Pairwise approach
- The loss function looks at the scores of document pairs as an inequality instead of the score of a single document.
Ranking
Stage-wise approach is being used. First stage will focus on the recall of the 5-10 documents in the first 500 results while the second stage will ensure precision of the top 5-10 relevant documents.
- Stage 1: pointwise approach
- Simple ML algorithms
- Performance: AUC or ROC
- Stage 2: pairwise approach
- LambdaMART
- LambdaRank
Example: Rental Search Ranking
Problem statement:
- Sort results based on the likelihood of booking.
- Build a supervised ML model to predict booking likelihood -> Binary classification model.
Metrics:
- Offline metrics:
- conversion rate: $conversion_rate = \frac{number_of_bookings}{number_of_search_results}$
- Online metrics:
- $DCG$
- $NDCG$
- $IDCG$
- Offline metrics:
Requirements
- Training:
- Imbalance data and clear-cut session
- Train/validation data: Split data by time to mimic production traffic.
- Inference:
- Serving: low latency
- Under-predicting for new listings: not enough data for model to estimate likelihood for brand new listings.
- Training:
Model
- Feature Engineering
- Geolocation (latitude/longitude): raw data is not smooth, we can use log of distance from the center center.
- Favourite places.
- Train data:
- decide the length of training data by running experiments.
- Model:
- Input: User data, search query, and Listing data.
- Output: binary classification, i.e., booking a rental or not
- fully connected layers as baseline, or more modern architecture (VAE).
- Feature Engineering
Calculation&&Estimation
- Assumptions: $5 * 30 * 10^8 = 1.25 b$ where 5-5 times per year per user, 30-see 30 rentals before booking, and 100m monthly active users.
- Data Size:
- $500 * 1.25 * 10^9$ where each row takes 500 bytes to store.
- Keep the last 6 months or 1 year of data in the data lake, and archive old data in cold storage.
- Scale: Support $150$ million users.
High-level design
- Feature pipeline: Process online features and store key-value feature pairs.
- Feature Store: We need low latency ($<10ms$) during inference: MySQL Cluster, Redis, and DynamoDB.
- To scale and serve millions of requests per second by scaling out Application Servers and use Load Balancers to distribute the load.
- Scale Search/Ranking services.
- Model Store is a distributed storage, like S3.
- Log all candidates as training data which is sent from Search Service to cloud storage or Kafka cluster.
- Follow up Questions
Search Ranking